Diagnosing a Failed Proof in 
Fault-Tolerance: A Disproving 
Challenge Problem* 

Lee Pike 1 , Paul Miner 2 , and Wilfredo 
Torres-Pomales 2 

1 Galois Connections 
Beaverton, Oregon, USA 
leepike@galois . com 
2 NASA Langley Research Center 
Hampton, VA, USA 

{p . s .miner, w. torres-pomales}@larc .nasa.gov 


Abstract. This paper proposes a chal- 
lenge problem in disproving. We de- 
scribe a fault-tolerant distributed pro- 
tocol designed at NASA for use in a 
fly-by-wire system for next-generation 
commercial aircraft. An early design of 
the protocol contains a subtle bug that 
is highly unlikely to be caught in fault- 
injection testing. We describe a failed 
proof of the protocol’s correctness in a 
mechanical theorem prover (PVS) with 
a complex unfinished proof conjecture. 

We use a model checking suite (SAL) 
to generate a concrete counterexample 
to the unproven conjecture to demon- 
strate the existence of a bug. However, 
we argue that the effort required in our 
approach is too high and propose what 
conditions a better solution would sat- 
isfy. We carefully describe the protocol 
and bug to provide a challenging but 
feasible case study for disproving re- 
search. 

1 Introduction 

Although rarely discussed in the archival litera- 
ture, many attempts to prove conjectures using 
interactive mechanical theorem proving fail. Pro- 
vided the theorem prover is sound and the conjec- 
ture is not both true and improvable - a possibility 
in mathematics - there are two possible reasons 
for a failed proof attempt. First, the conjecture 
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may be true, but the user lacks the resources or 
insight to prove it. Second, the conjecture may 
be false. It can be difficult to determine which of 
these is the case. 

When mathematicians cannot complete a proof 
of a conjecture, they begin to seek a counterex- 
ample to it. Mechanical theorem proving can ex- 
acerbate this difficult task. The difficulty is partly 
due to theorem provers often being used to rea- 
son about algorithms and protocols. Proofs of cor- 
rectness in this domain often involve nested case- 
analysis. A proof obligation that cannot be com- 
pleted is often deep within the proof, where intu- 
ition about the system behavior - and what would 
constitute a counterexample - wanes. The diffi- 
culty is also due to the nature of mechanical theo- 
rem proving. The proof steps issued in such a sys- 
tem are fine-grained. Formal specifications make 
explicit much of the detail that is suppressed in 
informal models. The detail and formality of the 
specification and proof makes the discovery of a 
counterexample more difficult. 

We present a case study that highlights this dif- 
ficulty. We describe the formal verification of a 
distributed fault-tolerant protocol in the mechan- 
ical theorem prover PVS [2], A conjecture about 
the protocol is partially verified by case-analysis, 
leaving a single unproven case. The case involves 
a complex set of fault statuses and system invari- 
ants. 

In particular, the protocol investigated is an 
interactive consistency protocol for use in the 
Reliable Optical Bus (ROBUS), a state-of-the- 
art ultra-reliable communications bus under de- 
velopment at the NASA Langley Research Cen- 
ter and the National Institute of Aerospace. 
It is being developed as part of the Scalable 
Processor-Independent Design for Extended Reli- 
ability (SPIDER) project [3,4], SPIDER is a fam- 
ily of ultra-reliable architectures built upon the 
ROBUS. Currently, ROBUS implementations in- 
clude a FPGA-based prototype. 

The counterexample was initially discovered by 
the third author through “engineering insight.” 
The bug in the protocol design occurs only when 
there are two simultaneous Byzantine faults [5, 6]. 
As described in greater detail later in the paper, 
the bug arises from the interaction between the 
system’s fault assumptions and the local diagnoses 
made by nodes in the system. Local diagnoses are 
used in a fault-tolerant system to increase reliabil- 
ity and to maintain group membership, a group of 
mutually-trusted non-faulty nodes [7] . In a sense, 
the bug is due to the interplay of system operation 
(i.e., executing the protocol) and system survival 



(i.e., maintaining group membership). These con- 
cerns apply to variety of fault-tolerant embedded 
systems [8]. 

The protocol is designed to tolerate such a fault 
scenario. However, the subtlety of the scenario 
means it is extremely unlikely the bug would be 
caught during fault-injection testing [9]. Neverthe- 
less, safety-critical systems like SPIDER that are 
designed for use in commercial aircraft must have 
a failure rate no higher than lCU n to 10 -12 per 
hour of operation [8,10]. A design error that es- 
capes testing could adversely affect a system’s re- 
liability. We believe that if the bug had not been 
caught by insightful inspection, the only other way 
it would be caught is through formal analysis. 

In the paper, we describe our approach to for- 
mally uncover the bug. 3 Specifically, we model the 
failed proof obligation in a model checker, and at- 
tempt to prove it holds in a model in which param- 
eters have been interpreted with small constants. 
Using the counterexample generated by the model 
checker, one can quickly determine that the proto- 
col is incorrect. Furthermore, the counterexample 
suggests the appropriate modification to correct 
the bug. 

Motivation Unfortunately, we feel our approach is 
inadequate for the following reasons: 

— The approach is too interactive and onerous. 
It requires manually specifying the protocol 
and failed conjecture in a model checker and 
manually correcting the specification in the 
theorem prover. 

— The approach depends on the counterexample 
arising by instantiating the parameters with 
small finite values. 

— Indeed, we would like a more automated ap- 
proach to verify the parameterized protocol 
specification in the first place than is possible 
using mechanical theorem proving alone. 

Therefore, we offer this case study as a challenge 
problem to the disproving community. We believe 
researchers will find this problem of interest for 
the following reasons: 

— The protocol is industrially-relevant, and the 
bug is genuine. 

— The protocol can be described in English in 
just a few paragraphs (as is done in this pa- 
per), but the behavior of the protocol itself is 
subtle. 

3 The associated files can be retrieved at 

(http: //www. cs . indiana. edu/~lepike/ 

pub_pages/unproven . html) . 


— While we believe our approach is unsatisfac- 
tory, we also believe it approximates the best 
contemporary approach for disproving prob- 
lems like the one presented (a purpose of this 
paper is to solicit evidence to the contrary). 

— Formal specifications of the protocol in both 
a mechanical theorem prover and a model 
checker accompany this paper for reference. 

Organization We describe the ROBUS Interactive 
Consistency (IC) Protocol as well as the architec- 
ture on which it is intended to execute in Sec- 
tion 2. In Section 3, we describe the kinds and 
number of faults under which the ROBUS IC Pro- 
tocol should correctly execute. Section 4 states 
the correctness requirements for the protocol as 
well as the state invariants that must hold for 
the ROBUS IC Protocol to satisfy them. In Sec- 
tion 5, we informally describe the counterexam- 
ple, discuss its origins, and provide a “fix” for it. 
In Section 6 we describe the conjecture attempted 
in PVS and then our generation of a counterex- 
ample using the Symbolic Analysis Laboratory 
(SAL) [11]. Section 7 outlines the specific chal- 
lenge and describes some metrics for success. Con- 
cluding remarks are in Section 8. 

2 A ROBUS IC Protocol 

We begin by describing the background of the 
family of protocols from which the ROBUS IC 
Protocol comes. Then after describing the archi- 
tecture of the ROBUS, we describe the protocol’s 
behavior itself. 

Background Protocols like the one described in 
this paper are fault-tolerant consensus algorithms 
and are known as “interactive consistency” or 
“oral messages” protocols. The protocol presented 
here is based on a protocol designed by Davies and 
Wakerly [12]. Lynch’s textbook provides an intro- 
duction to these sort of protocols as well as point- 
ers into the literature [13]. Many of these proto- 
cols have been formally verified, both by theorem 
proving [14-16] and by model checking [17]. 

Architecture The architecture of the ROBUS is 
a fully-connected bipartite graph of two sets of 
nodes, Bus Interface Units (BIUs) and Redun- 
dancy Management Units (RMUs). BIUs provide 
the interface between the bus and hosts running 
applications that communicate over the bus. The 
RMUs provide redundancy. The architecture for 
the special case of three BIUs and three RMUs is 
shown in Figure 1. There must be a minimum of 
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one BIU and one RMU, and there need not be an 
equal number of BIUs and R.MUs. 



Fig. 1: The ROBUS Architecture 


Diagnostic Data Understanding the protocol be- 
havior requires a preliminary understanding of the 
diagnostic data collected by nodes. The protocol 
has a greater chance of succeeding if good nodes 
ignore faulty ones. Consequently, nodes maintain 
diagnoses against other nodes. These diagnoses re- 
sult from mechanisms to monitor the messages re- 
ceived during protocol execution. Diagnostic data 
is accumulated over multiple protocol executions. 

Each node maintains a diagnostic function as- 
signing each node (including itself) to one of 
the following three classifications: trusted, accused, 
and declared. Every (non-faulty) node assigns ev- 
ery other node to exactly one class. We call the 
node being labeled the defendant. If a node la- 
bels a defendant as trusted, then the node has 
insufficient evidence that the defendant is faulty. 
If it labels a defendant as accused, then it has lo- 
cal evidence that the defendant is faulty, but does 
not know whether other good nodes have similar 
evidence. Once a defendant is declared, all good 
nodes know that they share the declaration. 

Periodically, the R.MUs and BIUs execute a Dis- 
tributed Diagnosis Protocol in which the nodes 
submit the diagnoses accumulated thus far [7]. If 
enough good nodes have accused a defendant, then 
the defendant is declared. The Distributed Diag- 
nosis Protocol ensures that all good nodes agree 
on which nodes have been declared. 

2.1 Protocol Description 

We distinguish one BIU as the General. The 
ROBUS IC Protocol is a synchronous protocol de- 
signed to reliably transmit the General’s message 
despite faults in the system (the formal require- 
ments are provided in Section 4). In the follow- 
ing, a benign message is one that all nonfaulty 
nodes can detect came from a faulty node (see 
Section 3). The ROBUS IC Protocol is as follows 


(the message-passing events of the protocol are il- 
lustrated in Figure 2): 

1. The General, G, broadcasts its message, v, to 
all R.MUs. 

2. For each R.MU, if it receives a benign message 
from G, then it broadcasts the special message 
source error to all BIUs. Otherwise it relays 
the message it received. 

3. For each BIU b, if b has declared G, then b 
outputs the special message source error. Oth- 
erwise, if b received a benign message from an 
R.MU, then that R.MU is accused, b performs 
a majority vote over the values received from 
those R.MUs it trusts. If no majority exists, 
source error is the result; otherwise, the ma- 
jority value is the result. 

3 Faults 

Fault Classifications Faults result from innu- 
merable occurrences including physical damage, 
electromagnetic interference, and “slightly-out-of- 
spec” communication [5]. We collect these fault 
occurrences into fault types according to their ef- 
fects in the system. 

We adopt the hybrid fault model of Thambidu- 
rai and Park [18]. All non-faulty nodes are also 
said to be good. A node is called benign, or man- 
ifest, if it sends only benign messages. Benign 
messages abstract various sorts of misbehavior. A 
message that is sufficiently garbled during trans- 
mission may be caught by an error-checking code 
and deemed benign. In synchronized systems with 
global communication schedules, they also ab- 
stract messages not sent (i.e., a message is ex- 
pected by a receiver but is absent on a commu- 
nication channel) or messages that arrive at un- 
scheduled times. A node is called symmetric if it 
sends every receiver the same message, but these 
messages may be incorrect. A node is called asym- 
metric, or Byzantine [6], if it arbitrarily sends dif- 
ferent messages to different receivers. 

Fault Assumption A fault-tolerant protocol is de- 
signed to tolerate a certain number of faults of 
each fault type. For a protocol, this is specified by 
its maximum fault assumption (MFA) . A proof of 
correctness of a protocol is of the form, “If the 
MFA holds, then the protocol satisfies property 
P," where P is a correctness condition for the pro- 
tocol. The probability that a MFA holds is deter- 
mined by reliability analysis [19]. 

We call the MFA for the ROBUS IC Protocol 
the Interactive Consistency Dynamic Maximum 


3 




Fault Assumption (IC DMFA). ‘Dynamic’ empha- 
sizes that the fault assumption is parameterized 
by the local diagnoses of nodes, which change over 
time. 

Definition 1 (IC DMFA). Let GB, SB, and 

AB denote the sets of BIUs that are good, 
symmetrically- faulty, and asymmetrically- faulty, 
respectively . Let GR, SR, and AR represent the 
corresponding sets of RMUs, respectively. For 
good BIU b, let 2& denote the set of RMUs b trusts. 
This is b's trusted set. Define T r similarly - it is 
the set of BIUs that RMU r trusts. The following 
formulas together make up the IC DMFA. G is the 
General. For all good BIUs b and good RMUs r, 

1. I GRn T b I > \SR n T b \ + \AR n T b \ ; 

2. G £ AB D T r implies \AR D T b \ = 0 . 

The first clause ensures that a good BIU b 
contains strictly more good RMUs in Tj, than 
it does symmetrically-faulty or asymmetrically- 
faulty RMUs. The second clause ensures that ei- 
ther no good RMU r trusts an asymmetrically- 
faulty General, or no good BIU b trusts an 
asymmetrically-faulty RMU. 

4 The ROBUS IC Protocol 
Correctness 

We begin by stating the requirements for the 
ROBUS IC Protocol. We then state invariants 
that must hold in a system executing the ROBUS 
IC Protocol in order for it to meet these require- 
ments. 

Requirements Two requirements must hold. 

Definition 2 (Agreement). All good BIUs 
compute the same value. 

Definition 3 (Validity). If the General is good 
and broadcasts message v, then the value computed 
by a good BIU is v. 

Diagnostic Assumptions In addition to constrain- 
ing the number of and kind of faults, the cor- 
rectness of the ROBUS IC Protocol depends on 
the diagnostic mechanisms satisfying certain con- 
straints. Let &i and b 2 be good BIUs, let rq be a 
good RMU, and let n be either a BIU or RMU of 
any fault classification. 

Definition 4 (Good Trusted), bi trusts n if n 
is good. 

Definition 5 (Symmetric Agreement). If n 

is not asymmetrically-faulty, bi accuses n if and 
only if b -2 accuses n. 


Definition 6 (Conviction Agreement). b\ 

declares n if and only if rq declares n. 

These properties similarly hold for any two good 
RMUs with respect to a defendant n. 

Intuitively, Good Trusted ensures that diagnos- 
tic mechanisms never lead a good node to ac- 
cuse another good node. Symmetric Agreement 
ensures that all good nodes that receive the same 
data make the same diagnosis. Note, however, that 
Symmetric Agreement allows a good BIU and a 
good RMU to make different diagnoses about a 
node that is asymmetrically-faulty. Finally, Con- 
viction Agreement is a correctness requirement of 
the Distributed Diagnosis Protocol [7], and it is 
a precondition for the correctness of the protocol 
under investigation in this paper. Together, these 
three assumptions are called the Diagnostic As- 
sumptions. 

5 The Counterexample 

We describe the counterexample informally and 
briefly describe its origins. We then describe a pro- 
tocol that does not suffer from the flaw. 

Counterexample The following instance of the 
ROBUS IC Protocol violates Agreement. Con- 
sider an architecture containing three BIUs, G, 
bi, and b 2 , and three RMUs rq, r 2 , and r 3 . Let 
the General be asymmetrically-faulty. Let RMU 
rq be asymmetrically-faulty, too, and let all other 
nodes be good. Suppose b\ and b 2 either accuse 
or trust G (it does not matter which), and they 
trust all RMUs. Furthermore, suppose &i and b 2 
trust rq, but no good RMU trusts G. These hy- 
potheses satisfy the IC DMFA and the Diagnostic 
Assumptions. Agreement is violated if the follow- 
ing instance of the ROBUS IC Protocol transpires, 
as illustrated in Figure 2. 

1. G sends message v to r i and r 2 , and it sends 
message u to r 3 , where v ^ u. 

2. rq sends message v to b± and utob 2 . r 2 sends 
message v to both bi and b 2 . r 3 sends message 
u to both bi and b 2 . 

3. bi outputs v whereas b 2 outputs u. 

Origins of the Flaw The flaw in the ROBUS IC 
Protocol was introduced when an earlier version 
of the protocol was amended to allow for the rein- 
tegration of transiently-faulty nodes [20]. A node 
becomes transiently-faulty when its state is dis- 
rupted (due, e.g., to exposure to high-intensity 
radiation), but the node is not permanently dam- 
aged. A node that suffers a transient fault has the 
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Fig. 2: An Instance of the ROBUS IC Protocol 
Violating Agreement 


potential to reintegrate with the good nodes in the 
system by restoring consistent state with them. 

In the earlier version of the ROBUS IC Pro- 
tocol, an RMU would only relay a message from 
the General if it trusted the General. Otherwise, 
the source error message was relayed. To allow 
for reintegration, the messages from a previously- 
declared General needed to be relayed by RMUs 
so that the BIUs can determine whether it is fit 
for reintegration. However, the flaw in the ROBUS 
IC Protocol arose when the earlier protocol was 
changed so that RMUs relayed the message from 
the General regardless of its diagnostic status, so 
long as it did not send a benign message. 

A New ROBUS IC Protocol In retrospect, a hx to 
the protocol is simple. Step 2 of the protocol de- 
scription in Section 2.1 is changed so that an RMU 
r relays the message source error if it receives a 
benign message or if r accuses the General. If the 
General is declared, its message is relayed to allow 
BIUs to gather diagnostic data about the General. 
An accused General implies that the General re- 
cently suffered a fault (assuming the accuser is 
good), so there is no need to relay its message 
for reintegration purposes. The correctness of the 
protocol, proved in PVS, is described by Miner et 
al. [21], 

6 Formally Deriving the 
Counterexample 

In this section, we describe the unfinished proof 
obligation generated in our attempt to formally 
prove a conjecture about the ROBUS IC Protocol. 
We then describe our use of a model checker to 
derive a counterexample to the conjecture. 

6.1 Generating the Proof Obligation 

In our approach, we use the PVS theorem prov- 
ing system developed by SRI International [2]. 


We have used PVS to specify and verify other 
ROBUS protocols [21,7]. The specification lan- 
guage of PVS is a strongly-typed higher-order 
logic, and the proof system is the classical sequent 
calculus. 

Various details about the construction of the 
underlying theories used to model the algorithm 
and ROBUS are irrelevant. 4 A general discussion 
of the abstractions used in the model is provided 
elsewhere [22]. The following notation is used in 
the formal statements of the Agreement Conjec- 
ture and the unproved sequent in Figure 3 and 
Figure 4. 

Variables and Parameters The parameters B and 
R are uninterpreted natural numbers. The set of 
BIUs and RMUs are indexed from 0 to B - 1 and 
0 to R - 1, and these sets of indicies are denoted 
below(B) and below(R), respectively. Thus, the 
PVS specification is parameterized by the num- 
ber of BIUs and RMUs, and a proof of correct- 
ness holds for any instantiation of these parame- 
ters that satisfy the hypotheses of the proof. Let 
bl, b2, G £ below (B) , where G is used to designate 
the General. F is a variable over the set of records 
(i.e., named tuples [23]) that contain all of the 
diagnoses in the system. The ‘ operator provides 
record access. Thus, F‘RB denotes the collection of 
the BIUs’ diagnoses against the RMUs, F‘BB de- 
notes the BIUs’ diagnoses against the BIUs, and 
similarly for F ‘ RB and F ‘ RR. F ‘ RB (bl) (r) denotes 
bl’s diagnosis of r, and similarly for the other 
diagnoses. F‘RB(bl)(r) yields a value from the 
set {trusted, accused, declared}. The function 
b_status is a function mapping BIUs to some 
fault class - one of good, benign, symmetric, and 
asymmetric, and similarly, r_status maps RMUs 
to a fault class. Finally, msg is an arbitrary mes- 
sage being broadcast by the General. 

Functions and Relations The following functions 
and relations appear in the sequent. 

— good? is a predicate that takes the fault status 
of a node and is true if the status is good. 
benign?, symmetric?, and asymmetric? are 
similarly defined. 

— IC_DMFA is a formal statement of the IC DMFA 
described in Section 3. 

— all_correct_accs? is a predicate formally 
stating the Diagnostic Assumptions defined in 
Section 4. 

4 The PVS models are more abstract than needed to 
model this protocol since the many of the same the- 
ories are generalized to model other ROBUS proto- 
cols [21]. 
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— declared? is a predicate that takes the diag- 
nosis made by one node against a defendant 
node and is true if the defendant is declared. 
Similarly, trusted? is true if the defendant is 
trusted. 

— robus_ic is a higher-order function that func- 
tionally models the ROBUS IC Protocol, as 
described in Section 2.1. It takes as arguments 
the fault statuses of the BIUs and RMUs, the 
diagnoses a BIU makes of G, as well as the set 
of its other diagnoses. It returns another func- 
tion that takes the General’s identifier, the 
message it sends, and a BIU identifier. The 
function returns the message the BIU outputs 
after the execution of the ROBUS IC Proto- 
col. The function is essentially the composi- 
tion of two functions modeling the two rounds 
of message passing (recall that we are model- 
ing the protocol in the synchronous domain, 
so the rounds of message passing is the gran- 
ularity at which time is modeled). 

It is the convention of PVS to denote skolem con- 
stants with a trailing “ ! n,” where n is some inte- 
ger. 

The Sequent The conjecture to be proved is shown 
in Figure 3. Assuming that bl and b2 are both 
good, that the Diagnoses Assumptions hold, and 
that the IC DMFA holds, we attempt to prove that 
the result of robus_ic is the same when applied 
to bl and b2. 


Agreement : CONJECTURE 


good?(b_status (bl) ) AND 


good?(b_status (b2) ) AND 


all_correct_accs?(b_status , r_status, F) 

AND 

IC_DMFA(b_status , r .status, F) 


robus_ic(b_status, r_status, F‘BB(bl)(G), 

, F‘RB(bl) ) 

(G, msg, bl) = 


robus_ic(b_status, r_status, F‘BB(b2)(G), 

, F‘RB(b2)) 

(G, msg, b2) 



Fig. 3 : The Agreement Conjecture in PVS 


Every branch of the conjecture in Figure 3 is 
discharged except for the branch ending in the sin- 
gle sequent in Figure 4 (irrelevant formulas have 
been omitted). PVS labels the formulas in the an- 
tecedent with negative integers, while those in the 
consequent are labeled with positive integers. 

6.2 Model Checking the Sequent 

We use the Symbolic Analysis Laboratory 
(SAL) [24,11], also developed by SRI Interna- 
tional, to model check the protocol against the 
undischarged sequent. SAL is a family of model 
checkers that includes symbolic, bounded, and 


[-1] good? (r_status ! l(r! 1)) 

[-2] asymmetric? (b_status ! 1 (G ! 1) ) 

[-3] IC_DMFA(b_status ! 1 , r_status!l, F ! 1) 

[-4] all_correct_accs?(b_status ! 1 , r_status!l, F ! 1 ) 


[1] trusted? (F! 1 ‘BR(r ! 1) (G! 1) ) 

[2] declared? (F ! 1 ‘BB(b2 ! 1) (G ! 1) ) 

{3} (FORALL (p_l: below(R)): 

(trusted? (F! l‘RB(bl ! 1) (p_l)) => 

NOT asymmetric? (r_status ! 1 (p_l) ) ) ) 

& 

(FORALL (p_l: below(R)): 

(trusted? (F! l‘RB(b2! 1) (p_l)) => 

NOT asymmetric? (r_status ! l(p_l)))) 

[4] declared? (F ! 1 ‘BB(bl ! 1) (G ! 1) ) 

[5] robus_ic(b_status ! 1 , r_status!l, 

F! l‘BB(bl ! 1) (G! 1) , F!l‘RB(bl!l)) 
(G ! 1 , msg! 1 , bill) 

robus_ic(b_status ! 1 , r_status!l, 

F ! 1 ‘BB(b2 ! 1) (G ! 1) , F! l‘RB(b2! 1)) 
(G! 1 , msg! 1 , b2 ! 1) 


Fig. 4: The Unproven PVS Sequent 

explicit-state model checkers, among other tools. 
The SAL language includes constructs such as re- 
cursive function definition, synchronous and asyn- 
chronous composition operators, and quantifiers 
over finite types. We particularly exploit the quan- 
tifier, recursive function, and synchronous compo- 
sition constructs. 

Our SAL model builds on the model of Oral 
Messages that is explained in detail in Rushby’s 
SAL tutorial [17]. Our model differs slightly as 
we must represent the local diagnoses data of 
each node, the Diagnosis Assumptions, and the 
IC DMFA, which is parametrized by the local di- 
agnoses. Furthermore, we state these constraints 
explicitly rather than embedding them into the 
system model. We found this makes our model 
more perspicuous. 

A sequent can be read as stating that if the 
conjunction of the antecedent statements is true, 
then the disjunction of the consequent statements 
is true. That is, if A is the set of antecedents and 
C is the set of consequents, a sequent is equivalent 
to the conditional 

A-4=^V C - a) 

This formulation is used to express the sequent 
in SAL and appears in Figure 5. There, SYSTEM 
denotes the model of the ROBUS IC Protocol de- 
veloped in the model checker, the symbol I - de- 
notes the purported satisfaction relation between 
the model and G is the global-state operator of 
LTL (not to be confused with the denotation of 
the General). 

SAL has an imperative language, so some of 
the predicates in the PVS sequent have been ex- 
pressed equationally. Some of the functions of PVS 
have been converted to arrays in SAL, giving rise 
to the bracket notation. 
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counterex: THEOREM SYSTEM |- 
G( (pc = 4 AND 

r_status [1] = good AND 
G_status = asymmetric AND 
IC_DMFA(r_status , F_RB, F_BR, G_status) AND 
all_correct_accs(r_status, F_RB, 

G_status, F_BR, F_BB)) 

=> 

(F_BR[1] = trusted OR 
F_BB [2] = declared OR 

(FORALL (r: RMUs) : F_RB[l][r] = trusted => 
r_status[r] /= asymmetric AND 
FORALL (r: RMUs): F_RB[2][r] = trusted => 
r_status[r] /= asymmetric) OR 
F_BB [1] = declared OR 
robus_ic[l] = robus_ic [2] ) ) ; 


Fig. 5 : The SAL Formulation of the Undischarged 
Sequent 

Two additional statements in the LTL formu- 
lation are artifacts of how the protocol is mod- 
eled in the model checker, both of which come 
from Rushby’s original formulation. First, there 
is a program counter pc that represents which 
round of the protocol is currently executing. These 
rounds correspond to the three rounds described 
in Section 2.1. When pc = 4, the last round has 
completed. The second artifact is the imperative 
definition of the result of the ROBUS IC Protocol 
using the array called robus_ic. 

Thus, the conjecture in Figure 5 can be read 
as stating that in every state reachable from the 
initial state of SYSTEM, the formulation of the un- 
proven sequent described above is true. 

A counterexample to the formula in Figure 5 
is a reachable state in which the formula is false. 
As mentioned, that formula is derived from the 
conditional interpretation of a sequent in (1). The 
negation of (1) is equivalent to 

A^UC), (2) 

where C denotes the negation of each formula in 
C. A counterexample is therefore a reachable state 
in which (2) is true. Such a state matches the in- 
formal description of the counterexample in Sec- 
tion 5. 

Using SAL’s symbolic model checker on a sys- 
tem with one gigabyte of memory and an AMD 
Athlon 2000+ processor, a counterexample like 
the one described in Section 5 is discovered in 
about 16 seconds for three RMUs and three BIUs, 
including the General. One may wonder whether 
this counterexample arises from the system hav- 
ing too few RMUs to relay messages. Increasing 
the number of RMUs quickly overwhelms the sym- 
bolic model checker. However, we obtain a simi- 
lar counterexample using SAL’s bounded model 
checker for seven RMUs in a little over two min- 
utes on the same system. 


These concrete counterexamples demonstrate 
that the unproved sequent cannot be discharged 
because the protocol itself has an error. Changing 
the PVS and SAL models to include the fix sug- 
gested in Section 5 allows the Agreement proof to 
be completed [21], and SAL verifies the formula 
in Figure 5 in a model using the same number of 
BIUs and RMUs as used to find the counterexam- 
ple (the fix is included as commented code in the 
SAL model available on-line 3 ). 

6.3 Remarks on our Approach 

In our approach, we manually model the protocol 
and the requirements both in PVS and SAL. This 
is simultaneously advantageous and disadvanta- 
geous. Having to model the protocol and require- 
ments in distinct languages provides an additional 
guard against modeling errors in each language. 
In particular, we wish to guard against false neg- 
atives, which are particularly easy to generate in 
model checkers. 

A disadvantage is the additional work required 
to model the protocol and requirements twice. 
This approach is not feasible to check numerous 
failed proof conjectures in a proof attempt. 

Some limitations of this approach are inherent 
to the limitations of symbolic model checking, in 
general. A model checker is useful when the sys- 
tem can be modeled as a finite-state state ma- 
chine, and the requirements to be proved can be 
modeled in a temporal logic. As well, a counterex- 
ample may exist, but be beyond the computa- 
tional limits of the model checker and the com- 
puter on which it executes. 

Despite these limitations, we believe this ap- 
proach approximates the current state-of-the-art 
to verify and discover bugs for a protocol like the 
one described. 

Related Work Much previous work that inte- 
grates model checking and interactive theorem 
proving has focused on using model checking to 
automate proving rather than on disproving [25, 
26]. Some theorem provers have embedded model 
checkers (both PVS and ACL2 contain embed- 
ded /i-calculus model checkers [27,28]). Most re- 
lated to our work is a study in which resolution- 
based theorem proving and model checking are 
used to discover counterexamples to proof obli- 
gations [29] . Our work differs in that we present a 
reasonably intricate protocol whereas a small illus- 
trative example is presented therein. As well, the 
focus therein is on automated theorem proving; 
our focus is on using model checking to facilitate 
interactive theorem proving. 
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7 The Challenge 

The challenge is as follows: 

From a parameterized specification of the 
protocol (from which a general proof can 
be obtained), provide a concrete instance 
of the bug in a way that requires as little 
effort, from the user as possible. 

This section describes how our approach could be 
improved as well as speculation about other ap- 
proaches. 

Specification Languages In our specific case, the 
specification languages of PVS and SAL are sim- 
ilar, and it is a goal of SRI to develop translators 
between them [30] . Building a translator between 
the languages is not trivial as the languages over- 
lap, but neither is a subset of the other. The lack 
of a translator required us to model the protocol 
twice. 

More generally, efficient disproving requires bet- 
ter integration between proving tools (e.g., the- 
orem provers) and disproving tools (e.g., model 
checkers) [31, 32] regardless of the theorem prover 
or model checker choice made. As noted in Sec- 
tion 6, some theorem-provers contain model- 
checkers, but it is not clear that these model 
checkers can handle this case study. 

Parameter Interpretation Even if a translator ex- 
isted, the parameterized PVS specification is not 
immediately amenable to finite-state model check- 
ing: the parameters must be interpreted. For a sys- 
tem with many parameters or a large specification, 
the interpretation is tedious if done manually. The 
tedium is compounded by the need to find param- 
eters small enough to make model checking feasi- 
ble yet large enough to expose the counterexam- 
ple. 

A quickcheck approach, like the one that ap- 
pears in Isabelle [33], may be sufficient to demon- 
strate a counterexample. However, doing so re- 
quires a reformulation of the problem since our 
formulation in PVS uses non-executable con- 
structs (e.g., quantifiers and Hilbert’s choice op- 
erator). How this could be done for a parame- 
terized specification of the protocol that is not 
implementation-specific is unknown to the au- 
thors. 

Proof Calculus Translation A minor but relevant 
task in our approach is the translation of a failed 
conjecture in the sequent calculus into an LTL for- 
mula. We hope for this task to be automated, too 
(for whatever proof calculus and temporal logic is 
used) . 


Proving and Disproving Ideally, both proving and 
disproving would be automatic. Automated dis- 
proving appears to be an easier challenge to meet 
than proving. That said, recent advances in satis- 
fiability modulo theories (SMT) provers [34] hold 
promise. Fault-tolerant protocols have generally 
been good candidates for mechanical theorem 
proving given their criticality and complexity, but 
SMT technology may provide a more automated 
approach. Currently, SMT provers are not capable 
of proving a fully-parameterized specification of 
the protocol presented, but recent applications of 
SMT to verifying fault-tolerant and real-time pro- 
tocols are promising [35, 20] . For these sort of ver- 
ifications, SMT is particularly useful when com- 
bined with bounded-model checking to do highly- 
automated induction proofs of safety properties 
over infinite-state systems [36] . We hope the SMT 
community also takes up the protocol presented 
as a challenge problem in parameterized proof as 
well as counterexample generation. 

We do not believe that automated first-order 
theorem provers alone can prove the correctness 
of the SPDIER IC Protocol. The protocol’s proof 
of correctness is parameterized in the number 
of BIUs and RMUs. Furthermore, the proof of 
correctness depends on reasoning about nontriv- 
ial mathematical facts (e.g., the IC DMFA). A 
first-order theorem prover could possibly be used 
to derive a counterexample for a fixed specifica- 
tion, replacing the model checker in our approach. 
Nonetheless, a similar interactive consistency pro- 
tocol has been specified and verified in ACL2, an 
interactive first-order theorem prover [14]. Some 
initial work has been done to translate the PVS 
specification of the SPIDER IC Protocol into a 
form suitable for SAT solving [37]. 


Effort. Requirements Given the criticality of the 
correct design of SPIDER and similar safety- 
critical embedded systems, their designers are 
willing to invest a great deal of effort to gain high 
assurance of their correctness. Therefore, the ac- 
ceptable level of effort required to obtain coun- 
terexamples is relatively high, as evidenced by 
the use of interactive theorem proving in the first 
place. Certainly the upper bound on the accept- 
able level of effort to uncover a counterexample is 
the time it takes a user to diagnose the error in the 
failed proof attempt in the theorem prover. This 
effort varies according to experience with the the- 
orem prover, expertise in the domain, the proof 
infrastructure, and the specific reason the proof 
has failed. 



8 Conclusion 

We have presented a subtly-flawed fault-tolerant 
protocol, its attempted verification by theorem 
proving, and our use of a model checker to demon- 
strate that a bug in the protocol prevents us from 
completing the verification. As noted, we believe 
our approach approximates the best possible with 
today’s technology. 

More generally, a variety of real-time fault- 
tolerant protocols have been designed for SPIDER 
and are described in detail [4]. Most of these pro- 
tocols have been verified using theorem proving 
and model checking with SMT [38,21,20]. These 
protocols are all suitable case studies to demon- 
strate novel verification tools and techniques. 
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